The dataset was taken from Kaggle: https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney
The dataset is an amalgamation of:
Goal: predict if title is available on Netflix or not
Assumption: rating, year of release, genre, runtime, country, ... might differ between streaming platforms and therefore be useful for the prediction
import pandas as pd
import numpy as np
import statistics
import seaborn as sns
import matplotlib.pyplot as plt
import pycountry_convert as pc
from pandas_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')
def readData(filename):
df = pd.read_csv(filename, sep=',', encoding='latin-1', index_col='Unnamed: 0')
profile = ProfileReport(df, title="Pandas Profiling Report")
#profile.to_file("pandas_report.html")
return df
df = readData('MoviesOnStreamingPlatforms_updated.csv')
Delete columns or rows, which do not contain enough useful data f.e. nan values and cannot be easily replaced data without further investigation.
# compare null values to total amount of values per column
def compareNanToTotal(df):
return pd.concat([df.isnull().sum(), df.count()], keys=['null_values', 'total_values'], axis=1)
compareNanToTotal(df)
def dropCols(df):
del df['Rotten Tomatoes']
del df['Type']
def dropRows(df):
columns = ['IMDb', 'Directors', 'Genres', 'Country', 'Language'] # there are to many directors for one-hot-encoding
for col in columns:
df = df[pd.notnull(df[col])]
df.drop_duplicates(inplace=True) # drop duplicate data
return df
def replaceNan(df):
columns = ['Runtime']
for col in columns:
median = statistics.median(df[col])
df[col] = df[col].replace(np.nan, median) # fill nan values with median (more robust towards outliers)
df['Age'] = df['Age'].replace(np.nan, 'all') # we assume there is no age constraint for nan values of age column
return df
dropCols(df)
df = dropRows(df)
df = replaceNan(df)
compareNanToTotal(df)
df = pd.get_dummies(df, prefix='age', prefix_sep='_', columns=['Age']) # get dummies for genres
df.tail()
genres = ['horror', 'thriller', 'sci-fi', 'documentary', 'action', 'animation', 'comedy', 'western', 'drama', 'fantasy', 'romance', 'music', 'musical', 'news',
'adventure', 'sport', 'reality-tv', 'history', 'family', 'biography', 'mystery', 'war', 'crime', 'short', 'film-noir', 'talk-show']
for genre in genres:
df['genre_' + genre] = 0
df.reset_index(drop=True, inplace=True)
for index, row in df.iterrows():
if ',' in row.Genres:
genres = row.Genres.split(',')
for genre in genres:
df.loc[df.index[index], 'genre_' + genre.lower()] = 1
else:
df.loc[df.index[index], 'genre_' + row.Genres.lower()] = 1
Retrieve continent name from country first, then perform one hot encoding
def check_continent(data):
c_entry = []
for entry in data:
if str(entry).strip() != 'nan':
continents = set()
countries = entry.split(',')
for country in countries:
if (country == 'West Germany' or country == 'East Germany' or
country == 'Holy See (Vatican City State)' or country == 'Czechoslovakia' or
country == 'Isle Of Man' or country == 'Reunion' or country == 'Soviet Union' or
country == 'Federal Republic of Yugoslavia' or country == 'Yugoslavia' or
country == 'Serbia and Montenegro' or country == 'Kosovo'):
continents.add('EU')
continue
if country == 'U.S. Virgin Islands':
continents.add('NA')
continue
if country == 'Korea':
continents.add('AS')
continue
if country == 'The Democratic Republic Of Congo' or "Ivoire" in country:
continents.add('AF')
continue
continent = pc.country_alpha2_to_continent_code(pc.country_name_to_country_alpha2(country, cn_name_format="default"))
continents.add(continent)
c_list = list(continents)
c_entry.append(','.join([str(elem) for elem in c_list]))
else:
c_entry.append(None)
return c_entry
df['Continent'] = check_continent(df['Country'])
countries = ['EU', 'AS', 'AF', 'NA', 'SA', 'OC']
for country in countries:
df[country] = 0
df.reset_index(drop=True, inplace=True)
for index, row in df.iterrows():
if row.Continent != None:
if ',' in row.Continent:
continents = row.Continent.split(',')
for continent in continents:
df.loc[df.index[index], continent] = 1
else:
df.loc[df.index[index], row.Continent] = 1
As visualized in the pandas profile report, there is a negative correlation between the columns Netflix and Prime Video, which leads to the conclusion that their provided services differ a lot. For an even better comparison, titles that are both on netflix and prime (323) get removed in the preprocessing. There are 2915 unique titles on Netflix and 11023 titles on Prime Video. The least amount of titles are available on the smaller platforms Hulu and Disney+.
We combine the unique titles on Netflix with the unique titles on Prime Video.
# remove movies that are both on netflix and prime
netflix = df.loc[(df['Netflix'] == 1) & (df['Prime Video'] == 0)]
print(netflix.shape)
netflix.head()
netflix_prime = df.loc[(df['Netflix'] == 1) & (df['Prime Video'] == 1)]
print(netflix_prime.shape) # 322 values in common --> drop or fill in favour for netflix?
netflix_prime.head()
prime = df.loc[(df['Netflix'] == 0) & (df['Prime Video'] == 1)]
print(prime.shape)
prime.head()
hulu = df.loc[(df['Hulu'] == 1)]
print(hulu.shape)
disney = df.loc[(df['Disney+'] == 1)]
print(disney.shape)
df_comb = prime.append(netflix)
df_comb.head()
# init with 0 because there are more movies on prime video than netflix
df_comb['onNetflix'] = 0
for index, value in df_comb.Netflix.items():
if value == 1:
df_comb.loc[df.index[index], 'onNetflix'] = 1
#df.onNetflix[index] = 1 #
df = df_comb.sort_values(by='ID', ascending=True) # sort by ID ascending
df.head()
# count movies per column (platforms or continent)
def movies_count(col, count=False):
if count==False:
print('Movies in {} are {}'. format(col, df[col].sum()))
else:
return df[col].sum()
# movies per platform
labels = 'Netflix', 'Hulu', 'Prime Video', 'Disney'
size = [movies_count('Netflix', count=True),
movies_count('Hulu', count=True),
movies_count('Prime Video', count=True),
movies_count('Disney+', count=True)]
explode = (0.1, 0.1, 0.1, 0.1)
# plotting
fig1, ax1 = plt.subplots()
ax1.pie(size,
labels = labels,
autopct = '%1.1f%%',
explode = explode,
shadow = True,
startangle = 100)
ax1.axis = ('equal')
plt.show()
# movies per continent
labels = 'EU', 'AS', 'AF', 'NA', 'SA', 'OC'
size = [movies_count('EU', count=True),
movies_count('AS', count=True),
movies_count('AF', count=True),
movies_count('NA', count=True),
movies_count('SA', count=True),
movies_count('OC', count=True)]
explode = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
#plotting
fig1, ax1 = plt.subplots()
ax1.pie(size,
labels = labels,
autopct = '%1.1f%%',
explode = explode,
shadow = True,
startangle = 100)
ax1.axis = ('equal')
plt.show()
# do languages correlate with continent plot?
s = df['Language'].str.split(',').apply(pd.Series,1).stack() # split index
s.index = s.index.droplevel(-1) # reduce index
s.name = 'Language'
df_language = pd.DataFrame(s)
df_language['Language'].value_counts()[:10].plot(kind='barh')
plt.show()
#checking years -> So we have movies starting from 1900, but mostly the data is from 2000 onwards
sns.histplot(df['Year'])
plt.show()
# checking imdb ratings
sns.histplot(df['IMDb'])
plt.show()
print('Minimum IMDb rating: ', df['IMDb'].min())
print('Maximum IMDb rating: ', df['IMDb'].max())
imdb_top_ten = df.nlargest(10, 'IMDb')
imdb_top_ten
# movies per genre
genres_columns = [col for col in df.columns if col.find("genre_") != -1]
data = []
for col in genres_columns:
size = len(df.loc[(df[col]==1)])
data.append(size)
genre_value_counts = pd.DataFrame(data, genres_columns)
plt.figure(figsize=(10,10))
ax = sns.barplot(x = genres_columns, y = genre_value_counts[0])
ax.set(xlabel='genres', ylabel='movies')
ax.set_title('Movies per Genre')
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(10,10))
sns.histplot(df['Runtime'])
plt.show()
# find director, which has the maximum number of movies
directors = df['Directors'].value_counts()
directors = directors.loc[directors > 10]
plt.figure(figsize=(10,10))
directors.plot(kind='bar')
plt.show()
Predict if title is available on Netflix or not (available on Prime Video instead).
# predict if on netflix or not
y = df.onNetflix
# X = all the features other features
X = df.drop(['ID', 'onNetflix', 'Netflix', 'Prime Video', 'Genres', 'Continent', 'Title', 'Directors', 'Country', 'Language'], axis = 1) # ID
X.head()
from sklearn.metrics import explained_variance_score, max_error, mean_squared_error, r2_score
def evaluateModel(model, train_x, test_x, train_y, test_y, model_name):
train_predict = model.predict(train_x)
test_predict = model.predict(test_x)
r2 = r2_score(train_y, train_predict)
error = max_error(train_y, train_predict)
mse = mean_squared_error(train_y, train_predict)
print("{0} train set metrics: ".format(model_name))
print("R-Squared: {0}\nMax Error: {1}\nMSE: {2}\n".format(r2, error, mse))
r2 = r2_score(test_y, test_predict)
error = max_error(test_y, test_predict)
mse = mean_squared_error(test_y, test_predict)
print("{0} test set metrics: ".format(model_name))
print("R-Squared: {0}\nMax Error: {1}\nMSE: {2}\n".format(r2, error, mse))
print("{0} \nAccuracy: {1}".format(model_name, model.score(test_x, test_y)))
return train_predict, test_predict
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
def standardize_data(data):
scaler = StandardScaler()
scaler.fit(data)
return scaler.transform(data)
# apply same transformation to both train and test data
train_normalized = MinMaxScaler().fit_transform(X_train)
test_normalized = MinMaxScaler().fit_transform(X_test)
train_standardized = standardize_data(X_train)
test_standardized = standardize_data(X_test)
We use the logistic regression as a classification model.
Although the accuracy score around 80% is promising, the negative value for R2 is bad, meaning that the chosen model fits the data really poorly.
The result of our confusion matrix is telling us that we have 3558+97 correct predictions and 868+77 incorrect predictions.
From our values for precision and recall we can gather that our classification is better to find class 0 (Prime Video) over class 1 (Netflix).
A possible interpretation for the high accuracy but varying values for precision and recall could be the imbalance within the dataset, where the majority of titles belong to Prime Video (around 11.000) opposed to a minority that belongs to Netflix (around 3000).
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=400).fit(X_train, y_train) # fit logistic model
train_predict, test_predict = evaluateModel(log_reg, X_train, X_test, y_train, y_test, model_name="Logistic Regression")
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LogisticRegression(max_iter=400), X_train, y_train, cv=5)
# training data gets split in k=5 samples
# each sample is used to train data k-1 times
# prevents overfitting
print(scores) # returns 5 scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) # little improvement of model when using cv
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, test_predict)
print(confusion_matrix)
from sklearn.metrics import classification_report
print(classification_report(y_test, test_predict))
The Decision Tree Classifier from sklearn is applied for this classification problem.
Beginning from the root node we can tell how a decision/prediction is made by looking a the leaves of the tree in the visualization. Decisions are based on certain thresholds for the different columns from the trainings data, meaning that if the value is smaller/larger than x, decision y is made.
A couple iterations where made to find the best depth for the decision tree. Both the information gain and gini index criterion (to measure the quality of a split) were used to determine the best tree depth - meaning the depth resulting in the best score.
When examining the decision tree with depth 5 using information gain criterion we can see 'genre_animation', 'genre_family', 'age_7+', 'runtime', 'Hulu', 'NA' (North America) and 'IMDb' where important factors in regards to the outcome of the classification.
f.e. Classification for Netflix appears to be influenced by kids content as it uses the indicators 'genre_family' and 'age_7+'.
f.e. Classification for Prime shows that newer titles which where released after 2018 on Hulu where also available on Prime Video, but not on Netflix.
When examining the decision tree with depth 6 using gini index criterion we can see 'age_18+', 'genre_history', 'genry_comedy', 'genre_action', 'Hulu', 'NA' (North America), 'genre_crime', 'IMDb' and runtime where important factors in regards to the outcome of the classification.
f.e. Similar to before Hulu seems to be a deciding factor for classifying if titles are available on Netflix or on Prime Video. Furthermore all titles that are not on Hulu, have a runtime longer than 143 minutes and belong to the genre crime can be assigned to Netflix.
When looking at the performance metrics of the classification we see a reoccuring pattern, with good accuracy but bad R2 score.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# build decision tree classifier to predict whether title is on netflix or prime
def makeCV(depth, criterion, *args):
clf = DecisionTreeClassifier(random_state=0, criterion = criterion, max_depth = depth)
clf.fit(X_train, y_train)
scores = cross_val_score(clf, X_train, y_train, cv=10) # normally between 5 and 10
if args:
plt.figure(figsize=(args[0], args[1]))
a = tree.plot_tree(clf,
feature_names=X_train.columns,
class_names=['onPrime', 'onNetflix'],
filled=True,
rounded=True,
fontsize=14)
plt.savefig('./plots/tree/tree_' + str(depth) + '_' + criterion +'.jpg',format='jpg',bbox_inches = "tight")
return scores.mean()
ig_scores = []
gini_scores = []
for depth in range(1, 11):
ig_scores.append( (depth, makeCV(depth, 'entropy') ) )
gini_scores.append( (depth, makeCV(depth, 'gini') ) )
for ig_score, gini_score in zip(ig_scores, gini_scores):
print('Depth:', ig_score[0], 'IG-Score:', ig_score[1], 'Gini-Score:', gini_score[1] )
print('Best tree Depth for information gain criterion: ', max(ig_scores,key=lambda item:item[1])[0])
print('Best Depth with for gini index criterion: ', max(gini_scores,key=lambda item:item[1])[0])
# chose depth of 5 for easier visualization and interpretation as there is only a small change in accuracy compared to the best depth of 9
print('Accuracy for Tree with depth 5 using information gain criterion: {:.3f}'.format(makeCV(5, 'entropy', 70, 10)))
print('Accuracy for Tree with depth 5 using gini index criterion: {:.3f}'.format(makeCV(6, 'gini', 120, 15)))
clf = DecisionTreeClassifier(random_state=0, criterion = 'gini', max_depth = 6)
clf.fit(X_train, y_train)
train_predict, test_predict = evaluateModel(clf, X_train, X_test, y_train, y_test, model_name="Decision Tree")
The MLPRegressor from sklearn is used to build multiple neural network which varying activation functions and layer sizes.
Examining the results we can see the logistic function performs better with less layers as scores seems to decrease with the increase of hidden layers.
Tanh and relu activation functions seem to follow the same trends in the change of amount of layers and neurons per layer.
When comparing the overall accuracy scores for normalized and standardized data, we can see tanh and logistic performing a little bit better with standardized data and relu performing better using normalized data.
It is hard to tell which is the "right" amount of layers and neurons as none of the scores are very good with the best accuracy being as close to 30%. Results could be influenced by local minima when using the wrong learning rate and too many layers or neurons could result in overfitting.
The standardized data provides slightly better results when compared to using normalized data.
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
def makeNN(data, activation, size):
# learning rate und maxiterations anpassen
regr = MLPRegressor(activation=activation, solver='sgd', learning_rate_init=0.04, hidden_layer_sizes=size, max_iter=5000)
regr.fit(data, y_train)
return regr
def testNN(sizes, train_x, test_x):
scores = []
for activation in ['tanh', 'relu', 'logistic']:
for size in sizes:
regr = makeNN(train_x, activation, size)
score = regr.score(test_x, y_test)
print("{0}: Score: {1} - Hidden Layer Size: {2}".format(activation, score, size))
print()
print()
def testLayerSizes(train_x, test_x):
sizes = [(2,), (5,), (10,), (20,), (30,), (100,)]
print('Testing for 1 Hidden Layer:')
testNN(sizes, train_x, test_x)
sizes = [(2,5), (5,2), (10,20), (20,10), (50,50)]
print('Testing for 2 Hidden Layers:')
testNN(sizes, train_x, test_x)
sizes = [(2,2,2,), (10, 10, 10), (10, 20, 30), (20, 20, 20,)]#, (50, 40, 30, 20, 10, 5)]
print('Testing for 3 Hidden Layers:')
testNN(sizes, train_x, test_x)
print('Testing for 4 Hidden Layers:')
testNN([(2,4,6,8,)], train_x, test_x)
testLayerSizes(train_standardized, test_standardized)
testLayerSizes(train_normalized, test_normalized)
regr = makeNN(train_standardized, 'tanh', (10,))
score = regr.score(test_standardized, y_test)
train_predict, test_predict = evaluateModel(regr, train_standardized, test_standardized, y_train, y_test, model_name="Neural Network")
regr = makeNN(train_normalized, 'tanh', (10, 20, 30))
score = regr.score(test_normalized, y_test)
train_predict, test_predict = evaluateModel(regr, train_normalized, test_normalized, y_train, y_test, model_name="Neural Network")
KNeighborsRegressor from sklearn is used as a non-linear classifier.
The best value of k is determined within the range of 2-30 using cross validation and different input data (default data without preprocessing, standardized data, normalized data).
At first glance it seems that the higher our number of neighbors (k) the better the results. However at a certain point increaing the number of neighbors shows no significant effect anymore, suggesting that the ideal amount of neighbors seems to be within the range of 20-30.
Comparing the scores we see manhattan distance and euclidian distance share similar metrics for the train data however when it comes to the test metrics, manhattan distance performs better. Also the Accuracy is better for manhattan distance.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
def findBestK(train_x, train_y):
knn = KNeighborsRegressor()
grid_search = GridSearchCV(estimator=knn, cv=KFold(n_splits=10, shuffle=True), param_grid={"n_neighbors": [2, 5, 10, 15, 20, 25, 30]}, scoring='neg_mean_squared_error')
grid_search.fit(train_x, train_y)
print("Best: {0} with score: {1}".format(grid_search.best_params_, grid_search.best_score_))
for mean, std, params in zip(grid_search.cv_results_['mean_test_score'], grid_search.cv_results_['std_test_score'], grid_search.cv_results_['params']):
print("%r: Mean score: %0.3f (+/- %0.3f)" % (params, mean, std))
return grid_search.best_estimator_
print("CV without preprocessing:")
kNN = findBestK(X_train, y_train)
print()
print("CV with standardized data:")
standardized_data = standardize_data(X_train)
kNN_std = findBestK(standardized_data, y_train)
print()
print("CV with normalized data:")
kNN_std = findBestK(train_normalized, y_train)
print()
# best results with 30 neighbours and non standardized or normalized data
# euclidian distance
knn = KNeighborsRegressor(n_neighbors = 30)
knn.fit(train_normalized, y_train)
train_predict, test_predict = evaluateModel(knn, train_normalized, X_test, y_train, y_test, model_name="k-Nearest Neighbours")
# manhattan distance
knn = KNeighborsRegressor(n_neighbors=30, p=1)
knn.fit(train_normalized, y_train)
train_predict, test_predict = evaluateModel(knn, train_normalized, test_normalized, y_train, y_test, model_name="k-Nearest Neighbours (Manhattan)")
The principal component analysis is used for reducing dimensionality from 42 to 35. More precisely, 95% of the variance in the data is explained by these components. Using the standardized data as input is important in PCA since it is a variance maximizing exercise.
import matplotlib.colors as mcolors
from mpl_toolkits.mplot3d import Axes3D # noqa: F401 unused import
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
pca = PCA(0.95)
transformed = pd.DataFrame(pca.fit_transform(train_standardized))
print("Number of components needed to explain 95% of variance: {0}".format(len(pca.components_)))
# Access the number of needed dimensions by plotting the cumulative explained variance
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.show()
from pandas.plotting import scatter_matrix
sm = scatter_matrix(transformed.iloc[:, :5], c=y_train, s=10)
T-SNE is used to reduce the dimensionality of the data to 2 dimensions. We render multiple scatter plots using varying perplexity values between 5 and 50 and 2000 iterations. For the testing purposes a randomized sample is generated and used for faster computation of T-SNE.
Perplexity values in the range of 10 to 20 seem to provide the best visual results, allthough it is hard to tell as the plots show multiple data clusters that have mixed distributions of both classes and therefore cannot clearly be labeled as a "Netflix" or "Prime Video". Unfortunately there is no visible distinct "Netflix" cluster, which corresponds to the results of previous models used for prediction. Perplexities bigger in the range of 25-50 seem to lead to more scattered data and less distinct clusters.
We can tell from the plots that using standardized and normalized data is having a huge inpact on the clusters that form. Plots for standardized and normalized data do not vary greatly.
def plotTSNE(data, target, title, n_iter=2000):
perplexities = [5, 10, 15, 20, 25, 30, 35, 50]
for idx, p in enumerate(perplexities):
tsne = TSNE(n_components=2, perplexity=p, n_iter=n_iter)
tsne_obj = tsne.fit_transform(data)
tsne_df = pd.DataFrame({'X':tsne_obj[:,0], 'Y':tsne_obj[:,1], 'target': target})
sns.scatterplot(x="X", y="Y", hue="target", alpha= 0.3, palette=['blue', 'red'], data=tsne_df, legend="full");
plt.title("Perplexity: {0}".format(p))
plt.show()
plt.savefig('./plots/'+ title + str(p) +'.png')
# make randomly generated subset for data for faster computation of tsne
import random
N = 1000
indices = random.sample(range(1, len(X_train)), N)
X_sample_standardized = pd.DataFrame(train_standardized).iloc[indices]
X_sample_normalized = pd.DataFrame(train_standardized).iloc[indices]
X_sample = X_train.iloc[indices]
y_sample = y_train.iloc[indices]
print("Data without preprocessing: ")
plotTSNE(X_sample, y_sample, 'tsne/tsne_')
print("Standardized data: ")
plotTSNE(X_sample_standardized, y_sample, 'tsne/tsne_std_')
print("Normalized data: ")
plotTSNE(X_sample_normalized, y_sample, 'tsne/tsne_norm_')
# now use complete dataset for T-SNE with chosen perplexity value
tsne = TSNE(n_components=2, perplexity=15, n_iter=2000)
tsne_data = tsne.fit_transform(train_normalized)
tsne_df = pd.DataFrame(tsne_data, columns=['x', 'y'])
We use the first 2 components from the PCA as input for the T-SNE which is resulting in data junks that are more connected.
def pca2(data):
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
return pd.DataFrame(data = pca_result, columns = ['pca-one', 'pca-two'])
def plotPCA_TSNE(data, target):
pca_df = pca2(data)
return plotTSNE(pca_df, target, 'pca+tsne/tsne_')
plotPCA_TSNE(pd.DataFrame(X_sample_standardized), y_sample)
k-Means clustering is applied on the T-SNE data. Multiple scatter plots for different values of k get plotted.
As the data points are pretty close to each other telling the different clusters appart visually is hard. Judging from the resulting plots a reasonable k could lie within the range of 20-30.
from sklearn.cluster import KMeans
def createKMeans(data, k):
print("\nNumber of clusters: {0}".format(k))
kmeans = KMeans(n_clusters=k, init='random')
kmeans.fit(data)
countPointsPerCluster(data, kmeans, 'kmeans')
visualizeClusters(data, kmeans, k, True)
def countPointsPerCluster(data, model, cluster_type):
print("Points per cluster:")
data[cluster_type] = model.labels_
clusters = data[cluster_type].value_counts()
for idx, value in enumerate(clusters):
print("Cluster {0}: {1}".format(idx, value))
def visualizeClusters(data, model, k, isKmeans=False):
colors = sns.color_palette("hls", k)
sns.scatterplot(x="x", y="y", hue=model.labels_, palette=colors, data=tsne_df, legend="full");
if isKmeans:
centers = model.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.title("Clustering: {0}".format(len(centers)))
plt.show()
plt.savefig('./plots/kmeans/kmeans_'+ str(k) +'.png')
for k in [20, 30]:
createKMeans(tsne_df, k)
Similar to k-Means Clustering the data points are pretty close to each other and telling the different clusters appart visually is hard. Spectral clustering algorithm classifies either only a few points or very big clusters as belonging to class 'Netflix'.
from sklearn.cluster import SpectralClustering
def createSpectral(data, k):
print("Number of clusters: {0}".format(k))
spectral = SpectralClustering(n_clusters=2, affinity="nearest_neighbors")
spectral.fit(tsne_df)
countPointsPerCluster(data, spectral, 'spectral')
visualizeSpectralClusters(data, spectral, k)
#visualizeClusters(data, spectral, k)
def visualizeSpectralClusters(data, model, k):
data['cluster'] = model.labels_
sns.scatterplot(x='x', y='y', data=data, alpha=0.03, hue='cluster')
plt.title("Clustering: {0}".format(k))
plt.show()
plt.savefig('./plots/spectral/spectral_'+ str(k) +'.png')
for k in [2, 5, 10, 20]:
createSpectral(tsne_df, k)
The models for logistic regression and decision tree classification appear to perform well when only taking accuracy scores into account. However when we compare the values to the R2 scores we see that the predictions of the models are not reliable.
Similarly a Neural Networks do not seem to be suitable for prediction allthough R2 (35 %) for train and test dataset is better in comparison to the models using logistic regression and the decision tree classifier. KNN has a similar R2 for the train dataset however performs quite badly for the R2 of the test dataset (-80%).
When comparing the plots of PCA and t-SNE we can see that t-SNE can visualize data in a more clustered way that makes interpretation easier. After dimensionality reduction using PCA there are still 35 components (51 before PCA) needed for an explained variance of 95%.
It was interesting to see the effect of combinin PCA with TSNE on the plots as interesting shapes where formed.
k-Means and spectral clustering show that there exist many clusters. A reason for the high amount of clusters might be - as previously stated - the relatively high number of components that the cluster algorithms had to deal with, even after dimensionality reduction was applied to the dataset. As the data points are pretty close to each other interpretation is quite tricky for these plots. Intrestingly while k-Means found increasingly more clusters the higher the value for k, higher values for k showed less inpact on the amount of clusters for spectral clustering which stuck to two clusters for all the used k.
Allthough there is a negative correlation between titles available on Netflix and Prime Video, the overall results suggest that the data available is not suitable for the predictions regardless of the different approaches used.
The results could most likely be improved using additional data or applying different preprocessing steps.